01_water_quality_data
Read in raw data
The raw data is from Water Quality Data in NYC.
EDA and Data Filtering
All variables:
## [1] "Sample.Number" "Sample.Date"
## [3] "Sample.Time" "Sample.Site"
## [5] "Sample.class" "Residual.Free.Chlorine..mg.L."
## [7] "Turbidity..NTU." "Fluoride..mg.L."
## [9] "Coliform..Quanti.Tray...MPN..100mL." "E.coli.Quanti.Tray...MPN.100mL."
The water quality data contain variables as below (Since we don’t have detailed explanation from the website, all explanation is based on my own understanding):
Sample level:
- Sample Number: A unique identifier assigned to each water sample.
- Sample Date: The date when the water sample was collected. It is usually formatted as MM/DD/YYYY.
- Sample Time: The time when the water sample was taken (timestamp format (ISO 8601)), providing precise timing. Here we don’t need such precise timing.
- Sample Site: The identifier that indicates the location where the
sample was collected. We can map this site to
sample_siteto get the accurate coordinate. - Sample Class: The classification of the sample. For example, “Compliance” indicates that these samples are collected as part of the regulatory monitoring program, while “Operational” help operators manage and maintain system performance, optimize treatment processes, and monitor system conditions. Here we don’t need such precise classification.
Water quality measurement:
- Residual Free Chlorine (mg/L): Free chlorine is used to maintain a disinfectant residual throughout the distribution system.
- Turbidity (NTU): Turbidity is a measure of water clarity. It indicates the amount of suspended particles in the water, which can impact both the appearance and the safety of the water.
- Fluoride (mg/L): Fluoride is often added to public water supplies to help prevent tooth decay.
- Coliform (Quanti-Tray) (MPN/100mL): Coliform bacteria are used as indicators of general water quality and sanitary conditions.
- E.coli (Quanti-Tray) (MPN/100mL): Similar to the coliform count. E. coli is a specific fecal indicator; its presence typically signals fecal contamination, which is a significant health concern.
sample date
water_quality <- water_quality %>%
mutate(date = as.Date(Sample.Date, format = "%m/%d/%Y")) %>%
mutate(year_month = format(date, "%Y-%m")) %>%
mutate(year = format(date, "%Y")) %>%
mutate(month = format(date, "%m")) water_quality %>%
group_by(year, month) %>%
summarise(count = n()) %>%
ungroup() %>%
ggplot(aes(x = month, y = year, fill = count)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Heatmap of Year-Month Combination Count", x = "Month", y = "Year") +
theme_minimal()We can find almost all the time we have the measurement, making it possible to do a time scale analysis.
sample site
We first do the deduplication of the reference sample site file. (Sample Site 39550)
names(sample_site)[c(1,3,4)] = c("Sample.Site","X","Y")
sample_site <- sample_site %>%
distinct(Sample.Site, .keep_all = TRUE)We first map the provided coordinates to the data and remove the NA values.
water_quality <- water_quality %>%
left_join(sample_site %>%
select("Sample.Site", "X", "Y"),
by = "Sample.Site")
water_quality <- water_quality %>%
filter(!is.na(X) & !is.na(Y))The provided sample coordinate seems to be encoded in
EPSG:2263 format. We first try to change this into the
latitude and longitude based encoding
WGS84(EPSG:4326) to make our visualization
easier.
water_quality <- st_as_sf(water_quality, coords = c("X", "Y"), crs = 2263)
water_quality$geometry <- st_transform(water_quality$geometry, 4326)From the changed geometry, we can find they are roughly consistent with the latitude and longitude of NYC, so we would expect the transformation is working.
## Geometry set for 151215 features
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -74.24467 ymin: 40.50231 xmax: -73.71687 ymax: 40.90798
## Geodetic CRS: WGS 84
## First 5 geometries:
We can try to visualize it on the map. Here we get the NYC neighborhood data from official website. The visualization also shows the coordinate mapping is satisfactory.
## Reading layer `nynta2020' from data source
## `/Users/dl3738/Documents/Work/Course/Data Visualization/Group_F_TBD/data/raw_data/nynta2020_25a/nynta2020.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 262 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120128.4 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
## Simple feature collection with 262 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
## Geodetic CRS: WGS 84
## First 10 features:
## BoroCode BoroName CountyFIPS NTA2020 NTAName
## 1 3 Brooklyn 047 BK0101 Greenpoint
## 2 3 Brooklyn 047 BK0102 Williamsburg
## 3 3 Brooklyn 047 BK0103 South Williamsburg
## 4 3 Brooklyn 047 BK0104 East Williamsburg
## 5 3 Brooklyn 047 BK0201 Brooklyn Heights
## 6 3 Brooklyn 047 BK0202 Downtown Brooklyn-DUMBO-Boerum Hill
## 7 3 Brooklyn 047 BK0203 Fort Greene
## 8 3 Brooklyn 047 BK0204 Clinton Hill
## 9 3 Brooklyn 047 BK0261 Brooklyn Navy Yard
## 10 3 Brooklyn 047 BK0301 Bedford-Stuyvesant (West)
## NTAAbbrev NTAType CDTA2020
## 1 Grnpt 0 BK01
## 2 Wllmsbrg 0 BK01
## 3 SWllmsbrg 0 BK01
## 4 EWllmsbrg 0 BK01
## 5 BkHts 0 BK02
## 6 DwntwnBk 0 BK02
## 7 FtGrn 0 BK02
## 8 ClntnHl 0 BK02
## 9 BkNvyYrd 6 BK02
## 10 BdSty_W 0 BK03
## CDTAName Shape_Leng
## 1 BK01 Williamsburg-Greenpoint (CD 1 Equivalent) 28919.56
## 2 BK01 Williamsburg-Greenpoint (CD 1 Equivalent) 28098.03
## 3 BK01 Williamsburg-Greenpoint (CD 1 Equivalent) 18250.28
## 4 BK01 Williamsburg-Greenpoint (CD 1 Equivalent) 43184.80
## 5 BK02 Downtown Brooklyn-Fort Greene (CD 2 Approximation) 14312.51
## 6 BK02 Downtown Brooklyn-Fort Greene (CD 2 Approximation) 30598.67
## 7 BK02 Downtown Brooklyn-Fort Greene (CD 2 Approximation) 23284.62
## 8 BK02 Downtown Brooklyn-Fort Greene (CD 2 Approximation) 18102.97
## 9 BK02 Downtown Brooklyn-Fort Greene (CD 2 Approximation) 39415.64
## 10 BK03 Bedford-Stuyvesant (CD 3 Approximation) 26307.65
## Shape_Area geometry
## 1 35321810 MULTIPOLYGON (((-73.93213 4...
## 2 28854314 MULTIPOLYGON (((-73.95814 4...
## 3 15208961 MULTIPOLYGON (((-73.95024 4...
## 4 52267408 MULTIPOLYGON (((-73.92406 4...
## 5 9982321 MULTIPOLYGON (((-73.99236 4...
## 6 23732978 MULTIPOLYGON (((-73.97906 4...
## 7 17534134 MULTIPOLYGON (((-73.96284 4...
## 8 14566585 MULTIPOLYGON (((-73.96135 4...
## 9 10106807 MULTIPOLYGON (((-73.97893 4...
## 10 36906699 MULTIPOLYGON (((-73.94414 4...
Chlorine
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -9.9900 0.4200 0.5700 0.5624 0.7100 2.2000 49
Turbidity
Extreme value: <0.10
Here for easy analysis, we directly change these values to 0.10.
water_quality <- water_quality %>%
mutate(Turbidity=gsub("[<>]", "", Turbidity)) %>%
mutate(Turbidity=as.numeric(Turbidity)) %>%
filter(!is.na(Turbidity) & Turbidity >= 0)## [1] 332
## [1] 0.002196334
water_quality %>%
filter(Turbidity <= 1.5) %>%
ggplot(aes(x=Turbidity)) +
geom_histogram(binwidth = 0.1)Fluoride
## [1] 19961
Since the NA value for Fluoride is too much and this variable is not very correlated with water quality, we can just discard it.
Coliform
Extreme value: <1, >200.5
water_quality <- water_quality %>%
mutate(Coliform=gsub("[<>]", "", Coliform)) %>%
mutate(Coliform=as.numeric(Coliform)) %>%
filter(!is.na(Coliform) & Coliform >= 0)## [1] 416
## [1] 0.002753253
Ecoli
Extreme value: <1
water_quality <- water_quality %>%
mutate(Ecoli=gsub("[<>]", "", Ecoli)) %>%
mutate(Ecoli=as.numeric(Ecoli)) %>%
filter(!is.na(Ecoli) & Ecoli >= 0)## [1] 1
## [1] 6.618396e-06
Variant selection
coordinates <- st_coordinates(water_quality)
water_quality <- water_quality %>%
mutate(longitude = coordinates[,1]) %>%
mutate(latitude = coordinates[,2])
water_quality <- water_quality %>%
select(Sample.Number, year_month, year, month, Residual_Chlorine, Turbidity, longitude, latitude) %>%
st_drop_geometry()Sample plot
Geometric mapping
We can change the time scale to see different patterns.
plot_data <- water_quality %>%
filter(as.character(year_month) == time) %>%
select(all_of(c(variant, "geometry"))) %>%
group_by(geometry) %>%
summarise(mean_value = mean(.data[[variant]], na.rm = TRUE))
pal <- colorNumeric(palette = "YlOrRd", domain = plot_data[["mean_value"]])
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addPolygons(data = neighborhoods, color = "#444", weight = 1, label = ~NTAName) %>%
addCircleMarkers(
data = plot_data,
radius = 5,
color = pal(plot_data[["mean_value"]]),
fillOpacity = 0.8,
stroke = FALSE,
popup = ~paste(variant, ": ", mean_value)
) %>%
addLegend(
pal = pal,
values = plot_data[["mean_value"]],
title = variant,
position = "bottomright"
)Trend along time scale
time_start <- ym("2023-01")
time_end <- ym("2024-12")
variant <- "Residual_Chlorine"
neighbourhood <- "Harlem (South)"sub_neighbourhood <- neighborhoods %>%
filter(NTAName == neighbourhood)
plot_data <- water_quality %>%
filter(ym(year_month) >= time_start & ym(year_month) <= time_end) %>%
select(all_of(c(variant, "geometry", "year_month"))) %>%
group_by(geometry, year_month) %>%
summarise(mean_value = mean(.data[[variant]], na.rm = TRUE)) %>%
ungroup()
if (neighbourhood != "Whole NYC")
plot_data <- st_filter(plot_data, sub_neighbourhood)
plot_data %>%
ggplot(aes(x=year_month, y=mean_value, group=1)) +
geom_smooth(method = "loess", se=TRUE) +
labs(
title = paste(variant, "change along time in", neighbourhood),
x = "Time",
y = variant
) +
theme(
plot.title = element_text(face = "bold", size = 13),
axis.text.y = element_text(face = "bold"),
axis.text.x = element_text(face = "bold", angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
panel.grid = element_blank(),
panel.background = element_blank(),
plot.background = element_blank(),
axis.line = element_line(color = "black")
)